String Matching in the DNA Alphabet

نویسندگان

Jorma Tarhio

Hannu Peltola

چکیده

Searching for occurrences of string patterns is a common problem in many applications. Various good solutions have been presented for string matching. The most efficient solutions in practice are based on the Boyer–Moore algorithm.1 A typical question in molecular biology is whether a given sequence has appeared elsewhere. In the following, we will concentrate on searching for exact occurrences of long patterns in the DNA alphabet which in a typical case contains four characters, namely a, c, g, and t. However, the biologists are often interested in finding similar sequences. Nevertheless, exact searching can be used as a fast subroutine of approximate searching. At low error levels any algorithm for exact searching can be used as a fast filtering method. Assume that we allow e errors. If we divide the pattern in e+1 distinct blocks, every approximate occurrence contains an exact occurrence of at least one of the blocks. Thus an occurrence of any block defines a potential approximate occurrence of the pattern, which can be checked with a slower dynamic programming method. Hume and Sunday3 review several techniques how to improve the practical efficiency of the Boyer–Moore algorithm using different shift heuristics, tight loops, unrolling of loops, and some other approaches. Their study mainly deals with searching for words of an English text, but they also tested their algorithms on DNA strings. Later Kim and Shawe-Taylor present more efficient solutions for DNA strings based on an implementation of an algorithm introduced by Baeza-Yates.5 We will introduce a new version of the Baeza-Yates algorithm,4,5 which is a modification of the Boyer–Moore–Horspool algorithm6 for small alphabets. In the Baeza-Yates algorithm the

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

انتخاب کوچکترین ابر رشته در DNA با استفاده از الگوریتم ازدحام ذرّات

A DNA string can be supposed a very long string on alphabet with 4 letters. Numerous scientists attempt in decoding of this string. since this string is very long , a shorter section of it that have overlapping on each other will be decoded .There is no information for the right position of these sections on main DNA string. It seems that the shortest string (substring of the main DNA string) i...

متن کامل

On-line string matching algorithms: survey and experimental results

In this paper we present a short survey and experimental results for well known sequential string matching algorithms. We consider algorithms based on different approaches including classical, suffix automata, bit-parallelism and hashing. We put special emphasis on algorithms recently presented such as Shift-Or and BNDM algorithms. We compare these algorithms in terms of the number of character...

متن کامل

Approximate String Matching with Reduced Alphabet

We present a method to speed up approximate string matching by mapping the factual alphabet to a smaller alphabet. We apply the alphabet reduction scheme to a tuned version of the approximate Boyer– Moore algorithm utilizing the Four-Russians technique. Our experiments show that the alphabet reduction makes the algorithm faster. Especially in the k-mismatch case, the new variation is faster tha...

متن کامل

A Fast Generic Sequence Matching Algorithm

A string matching—andmore generally, sequence matching—algorithm is presented that has a linear worst-case computing time bound, a low worst-case bound on the number of comparisons (2n), and sublinear average-case behavior that is better than that of the fastest versions of the Boyer-Moore algorithm. The algorithm retains its efficiency advantages in a wide variety of sequence matching problems...

متن کامل

Exact Multiple String Matching Problem for DNA Alphabet

Given a text T = t1t2 ... tn and a set of patterns P = {P1, P2, ..., Pr}, the exact multiple string matching problem (EMSMP) finds the ending positions of all sub-strings in T which is equal to Pi for 1  i  r. We regard all substrings in T and patterns in P as data points in an edit distance-based metric space. The data points in T are constructed into a vantage point tree (vp-tree) T. Then, ...

متن کامل

An Efficient Composite-Alphabet Transform for String Matching under a Restricted Alphabet Set

String matching is a problem of finding all occurrences of a short pattern on a relatively long reference string. While a number of methods have been presented, most published implementations assume several restrictions due to some practical issues. We focus on the restriction of the alphabet size, which is usually set to be 256 in many string matching libraries. When strings must be handled ov...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Softw., Pract. Exper.

دوره 27 شماره

صفحات -

تاریخ انتشار 1997

String Matching in the DNA Alphabet

نویسندگان

چکیده

منابع مشابه

انتخاب کوچکترین ابر رشته در DNA با استفاده از الگوریتم ازدحام ذرّات

On-line string matching algorithms: survey and experimental results

Approximate String Matching with Reduced Alphabet

A Fast Generic Sequence Matching Algorithm

Exact Multiple String Matching Problem for DNA Alphabet

An Efficient Composite-Alphabet Transform for String Matching under a Restricted Alphabet Set

عنوان ژورنال:

اشتراک گذاری